Big new design Part 2 :) #307

JackKelly · 2021-10-29T10:20:52Z

Pull Request

Description

This is quite a big PR, sorry, because it's plugging together the new code!

The majority of the file changes are to get the changes to pass the linter! (Sorry, I really should've done that in a separate PR, if I'd understood how much more strict the new linter config would be!)

Broadly implements an updated version of the design first sketched out in #213 (comment)

Also implements / fixes some other issues which were blocking this PR:

load_solar_pv_data_from_gcs() should use fsspec and hence be able to load data from any compute environment #286
Assert there's no overlap between train, test and validation datetimes at end of split() function #299
Change DataSourceList into Manager; and maintain DataSources in a dict instead of a list? #298
Allow user to configure the frequency of the t0 datetimes in the config yaml #277
Pass command-line-arguments into prepare_ml_data.py #171

The main bulk of this PR does several things (sorry that not all of these are strictly related to implementing issue #213!)

How Has This Been Tested?

No
Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.

…286

…f_each_example.csv for each split

JackKelly · 2021-10-29T15:02:19Z

Hi @jacobbieker OK, I think this new PR is just about ready for review. It's still definitely a draft, and there are a bunch of things still to do (please see the tick-list at the top of this PR conversation) but please do take a look at comment on the broad shape of the thing 🙂

jacobbieker

I like how it looks, I think the design is good!

jacobbieker · 2021-10-29T15:41:06Z

nowcasting_dataset/data_sources/data_source.py

+        zipped = list(zip(t0_datetimes, x_locations, y_locations))
+        batch_size = len(t0_datetimes)
+
+        with futures.ThreadPoolExecutor(max_workers=batch_size) as executor:


nowcasting_dataset/manager.py

tests/test_manager.py

scripts/prepare_ml_data.py

nowcasting_dataset/data_sources/data_source.py

peterdudfield

Great work @JackKelly.

You can really see the fruits of your labour when lots of things simplify nicely.

I've added quite a few minor comments, and the only major comment is https://github.com/openclimatefix/nowcasting_dataset/pull/307/files#r740808191, but perhaps I'm missing something there

Ill that you left a few TODOs in there which are too critical. Good to break up the PR with these

…es not exist

JackKelly · 2021-11-02T14:40:56Z

OK! I'm going to go ahead and merge this now. All the tests pass and prepare_ml_data.py runs on my little laptop-like thing (pulling data from leonardo over 1gig LAN). It's not as fast as I would like but I suspect that's because it's running on a little laptop-like machine and pulling data over the LAN! I'll benchmark on leonardo when leonardo isn't quite so busy 🙂 If it's still too slow then I'll implement #311. And if that still isn't fast enough then I'll re-implement the sample-twice-from-each-disk-load thing.

I need to fix #325 but that shouldn't stop the script from being used.

JackKelly added 11 commits October 28, 2021 12:03

Making a start on the big new design! Sketched out the basic design i…

b102da8

…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.

Implement arg_logger decorator

63f0a2a

enable load_solar_pv_data to load from any compute environment. Fixes #…

663852d

…286

Successfully gets t0 datetimes

61be554

fix incorrect logger message

ff18699

successfully checks for CSV file

8d5043b

Check there is no overlap between split datetimes. Fixes #299

8bef05c

Successfully creates directories and spatial_and_temporal_locations_o…

4d28923

…f_each_example.csv for each split

tidy up check_directories

33318b3

Fix merge conflicts with main

856fe64

implement Manager._get_first_batches_to_create()

af0b8f7

JackKelly added the refactoring label Oct 29, 2021

JackKelly self-assigned this Oct 29, 2021

JackKelly changed the base branch from main to jack/big-new-design October 29, 2021 10:24

JackKelly linked an issue Oct 29, 2021 that may be closed by this pull request

"Big new design" for nowcasting_dataset #213

Closed

38 tasks

JackKelly added the enhancement New feature or request label Oct 29, 2021

JackKelly added 5 commits October 29, 2021 12:59

start fleshing out Manager.create_batches()

b2387f3

Finish first rough draft of Manager.create_batches()

04d4fbb

Finally, a full complete draft of #213. Not yet tested

72e39c8

open DataSource

07db836

Delete datamodule.py and datasets.py

7004973

This was linked to issues Oct 29, 2021

Use independent processes for each "modality" #202

Closed

Remove PyTorch from the code #86

Closed

This was referenced Oct 29, 2021

Turn off temporal interpolation of numerical weather predictions (NWPs) #135

Closed

Re-implement test for getting daylight datetime index #310

Closed

Remove n_timesteps_per_batch and _cache from DataSources.

f896a5e

jacobbieker reviewed Oct 29, 2021

View reviewed changes

JackKelly added 2 commits October 29, 2021 17:00

Implement get_filesystem()

e3d1597

prepare_ml_data.py runs and successfully creates GSP batches!

af6707a